Memo 3 The Ultrascalar Processor An Asymptotically Scalable
نویسندگان
چکیده
Today’s superscalar processors rename registers, bypass registers, checkpoint state so that they can recover from speculative execution, check for dependencies, allocate execution units, and access multi-ported register files. The circuits employed are complex and irregular, requiring much effort and ingenuity to implement well. Furthermore, the delays through many of the circuits grow quadratically with issue width (the maximum number of simultaneously fetched or issued instructions) and window size (the maximum number of instructions within the processor core), making future scaling of today’s designs problematic [11, 4, 5]. With billion transistor chips on the horizon,1 this scalability barrier appears to be one of the most serious obstacles for high-performance uniprocessors in the next decade. Surprisingly, it is possible to extract the same instruction-level parallelism (ILP) with a regular circuit structure that has only logarithmic gate delay and linear wire delay (speed-of-light delay) or even sublinear wire delay, depending on how much memory bandwidth is required for the processor. This paper describes a new processor microarchitecture, called the Ultrascalar processor, based on such a circuit structure. The goal of this paper is to illustrate that processors can scale well with issue width and window size. We have designed a new microarchitecture and layed out its datapath. We have analyzed the asymptotic growth and empirically computed its area and critical-path delays for different window sizes. We have not optimized the Ultrascalar architecture to be competitive with today’s designs. Although we outline design choices that could make the Ultrascalar competitive, an optimized processor design is outside the scope of this paper. This paper also does not evaluate the benefits of larger issue widths and window sizes. Some work has been done showing the advantages of high-issuewidth and high-window-size processors. Lam and Wilson suggest that ILP of ten to twenty is available with an infinite instruction window and good branch prediction [7] . Patel, Evers and Patt demonstrate significant parallelism for a 16-wide machine given a good trace cache [13]. Patt et al argue that a window size of 1000’s is the best way to use large chips [14]. The amount of parallelism available in a thousand-wide instruction window with realistic branch prediction, for example, is not well understood however. The ultimate value of the Ultrascalar microarchitecture will depend on careful engineering for specific window size and on the available parallelism in programs.
منابع مشابه
Ultrascalar Memo 4 A Comparison of Asymptotically Scal - able Superscalar Processors 1
The poor scalability of existing superscalar processors has been of great concern to the computer engineering community. In particular, the critical-path lengths of many components in existing implementations grow as where is the fetch width, the issue width, or the window size. This paper describes two scalable processor architectures, the Ultrascalar I and the Ultrascalar II, and compares the...
متن کاملThe Ultrascalar Processor-An Asymptotically Scalable Superscalar Microarchitecture
The poor scalability of existing superscalar processors has been of great concern to the computer engineering community. In particular, the critical-path lengths of many components in existing implementations grow as (n) where n is the fetch width, the issue width, or the window size. This paper presents a novel implementation, called the Ultrascalar processor, that dramatically reduces the asy...
متن کاملA Formal Verification of Ultrascalar Processor using Term Rewriting Systems
Today’s microprocessors implement increasingly complex micro-architectures to achieve high performance. With increasing complexity, understanding the semantics of the instructions is more difficult. As a part of the processor design chain, verifying the correctness of a specification, is very important. A recent approach [1] [5] [6] is to describe the system and its components as terms generate...
متن کاملRandomized Fully-Scalable BSP Techniques for Multi-Searching and Convex Hull Construction (Preliminary Version)
We study randomized techniques for designing eecient algorithms on a p-processor bulk-synchronous parallel (BSP) computer, which is a parallel multicomputer that allows for general processor-to-processor communication rounds provided each processor is guaranteed to send and receive at most h items in any round. The measure of eeciency we use is in terms of the internal computation time of the p...
متن کاملRandomized Fully - Scalable BSP
We study randomized techniques for designing eecient algorithms on a p-processor bulk-synchronous parallel (BSP) computer, which is a parallel multicomputer that allows for general processor-to-processor communication rounds provided each processor is guaranteed to send and receive at most h items in any round. The measure of eeciency we use is in terms of the internal computation time of the p...
متن کامل